The Potential Benefits of Handling Mixture Statistics via a Bi‐Gaussian EnKF: Tests With All‐Sky Satellite Infrared Radiances

Abstract The meteorological characteristics of cloudy atmospheric columns can be very different from their clear counterparts. Thus, when a forecast ensemble is uncertain about the presence/absence of clouds at a specific atmospheric column (i.e., some members are clear while others are cloudy), that column's ensemble statistics will contain a mixture of clear and cloudy statistics. Such mixtures are inconsistent with the ensemble data assimilation algorithms currently used in numerical weather prediction. Hence, ensemble data assimilation algorithms that can handle such mixtures can potentially outperform currently used algorithms. In this study, we demonstrate the potential benefits of addressing such mixtures through a bi‐Gaussian extension of the ensemble Kalman filter (BGEnKF). The BGEnKF is compared against the commonly used ensemble Kalman filter (EnKF) using perfect model observing system simulated experiments (OSSEs) with a realistic weather model (the Weather Research and Forecast model). Synthetic all‐sky infrared radiance observations are assimilated in this study. In these OSSEs, the BGEnKF outperforms the EnKF in terms of the horizontal wind components, temperature, specific humidity, and simulated upper tropospheric water vapor channel infrared brightness temperatures. This study is one of the first to demonstrate the potential of a Gaussian mixture model EnKF with a realistic weather model. Our results thus motivate future research toward improving numerical Earth system predictions though explicitly handling mixture statistics.


Introduction
This document has several purposes. First, we will illustrate some differences between clear ensemble statistics and cloudy ensemble statistics. Differences like these motivate research into the BGEnKF and similar GMM-EnKFs. The second purpose is to provide a quick reference for other scientists to understand the BGEnKF, independently re-create our BGEnKF algorithm, and to support further development of the BGEnKF. To increase the accessibility of this area of research, we have written this document with graduate students in mind.

1.
Text S1: Some differences between clear and cloudy member statistics To set the stage, we plotted maps of the ensemble averaged Window-BT [ Figure S1(b)] and the fraction of cloudy member columns in the ensemble [ Figure S1(b)]. These ensemble quantities are constructed from the spun-up 50-member WRF ensemble described in the main text. Though the ensemble captured the general appearance of the organized convective features seen in the nature run [Figures 2(a) and S1(a)], the ensemble was uncertain about the presence/absence of clouds over much of the domain [ Figure S1 (b)]. This uncertainty is particularly noticeable over regions where the ensemble averaged Window-BT was between 248 K and 280 K.
Several differences between clear and cloudy member columns can be seen from Figure S1. First, the average Window-BT values of clear member columns are typically warmer than 280 K, whereas the average Window-BT values of cloudy member columns are cooler than 280 K [Figure S1(c & d)]. This difference is well known. As such, the Window-BT ensemble statistics of an ensemble of clear and cloudy member columns (henceforth, mixed ensemble) will exhibit mixed statistics.
The clear and cloudy member columns also differ noticeably in terms of their humidity fields and the Kalman gain linking Window-BT innovations to humidity increments. For the ease of visualization, we examined through a column-integrated measure of humidity that is a linear function of the WRF model state: the pseudo precipitable water (PPW). The PPW is defined as PPW ≡ g P sfc − P top 1 0 q v dη (1) where q v refers to water vapor mass mixing ratio (QVAPOR), P sfc and P top refer to model surface pressure and model top pressure, and η refers to the WRF model's vertical coordinate. The PPW can be derived from the definition of precipitable water by applying the hydrostatic approximation, the definition of WRF η levels, and by assuming that P sfc and P top are constants (P sfc ≡ 1000 hPa, P top ≡ 20 hPa).
We opted to use the linear PPW over precipitable water (PW) because PW is a nonlinear function of the model state. Thus, the Kalman gain linking PW to Window-BT within the same model column is not mathematically equivalent to taking a column-integral of the Kalman gain linking QVAPOR to Window-BT. In contrast, said mathematical equivalence holds for PPW. Looking at PPW over PW thus allows us to get an accurate sense of what the EnKF would do to QVAPOR within a model column. Figure S1 (c & d) indicates that the PPW of cloudy member columns is higher than that of clear member columns. This is because clouds require nearly saturated humidity to materialize. As such, when the ensemble is mixed, mixture statistics in the humidity fields are likely.
We also examined the component of the Kalman gains responsible for propagating Window-BT innovations to QVAPOR: the least squares linear regression coefficient linking Window-BT to QVAPOR (Anderson, 2003). For the ease of visualization, we looked at the coefficient linking Window-BT to PPW within the same column. This coefficient (β) is defined as Cov (PPW, BT) denotes the prior ensemble covariance between PPW and Window-BT within said model column, and Var (BT) denotes the prior ensemble variance of Window-BT within the same column. In the limit where Var(BT) is much smaller than the observation error, the Kalman gain turns into β.
As can be seen from Figure S1(e & f), the clear member columns' statistically significant β values are generally an order of magnitude larger than those of the cloudy member columns. This difference suggests that the statistical relationship between Window-BT and humidity can vary dramatically depending on the absence/presence of clouds.
A complication in handling clear members and cloudy members separately lies in the fact every member usually contains both clear model columns and cloudy model columns. Supposing we have N i * N j model columns in the domain, there can exist up to 2 N i * N j possible spatial combinations of clear and cloudy columns in the domain. Sampling these 2 N i * N j combinations would require more than 2 N i * N j ensemble members -a likely impractical proposition. Dimensional reduction is necessary to reduce the required number of ensemble members.
A simple and natural dimensional reduction approach is to limit our clear/cloudy considerations to small regions of the domain. This dimensional reduction approach is effectively a type of spatial localization -a commonly employed heuristic method used to limit the effects of sampling errors on EnKFs (Houtekamer & Zhang, 2016). As a first attempt at employing this localization, suppose we are assimilating observations one-at-a-time (i.e., serial assimilation). When assimilating the m-th observation, we will only consider model columns within 1 horizontal radius of influence (HROI) surrounding the observed column. If there are N loc columns within 1 HROI of the observed column, the number of possible spatial combinations falls from 2 N i * N j to 2 N loc . For commonly used HROI values, 2 N loc ≪ 2 N i * N j .
Though localization can dramatically reduce the number of spatial clear/cloudy combinations, 2 N loc is likely greater than the number of ensemble members N E . For instance, in the IR DA experiments of Chan, Zhang, et al. (2020) and ? (?), the HROI is approximately 11 model grid spacings (100-km HROI, 9-km grid spacing), meaning that there exist ∼ π * 11 2 ≈ 363 model columns within the localization zone. A typical ensemble size of ∼ 50 is much less than the number of spatial combinations in this example (∼ 2 363 ). Another measure is necessary to further simplify the problem.
We opted to assume that there are at most two clear/cloudy spatial combinations within the localized zone. To understand the rationale, consider that localized serial EnKFs assume that all ensemble members within 1 radius of influence (ROI) of an observation to be drawn from a Gaussian distribution (Burgers et al., 1998;Whitaker & Hamill, 2002;Anderson, 2003). This is equivalent to assuming that there exists only one spatial combination within 1 ROI of said observation. Our two spatial combination assumption, though imperfect, is closer to the actual number of spatial combinations (2 N loc ) than the one spatial combination assumption.
We can now consider that the ensemble members are drawn from a mixture of two distributions within the localized region. The EnKF can be extended to handle this mixture distribution by replacing the EnKF's Gaussian prior assumption. Specifically, we consider that some prior members are drawn from one Gaussian distribution and the other members are drawn from a different Gaussian distribution. The prior ensemble is thus assumed to be drawn from a bi-Gaussian prior distribution. The resulting algorithm will be henceforth termed the bi-Gaussian EnKF (BGEnKF).
For the BGEnKF to work, it is necessary to separate the ensemble members into two groups (henceforth termed "clusters"). The sample statistics of each cluster will correspond to one of two Gaussian kernels. As a first approach, we will consider members that are clear at the observation site to be drawn from one Gaussian distribution (henceforth termed the "clear kernel" or "clear cluster"). The remaining members will be considered to be drawn from a different Gaussian distribution (henceforth, the "cloudy kernel" or "cloudy cluster"). More advanced clustering approaches, such as those involving machine learning (e.g., support vector machines), can be considered at a later date.

3.
Text S3: Bayes' rule for the BGEnKF We will now formulate a serially assimilating BGEnKF (i.e., the algorithm assimilates one observation at a time) starting from Bayes' rule and using a notation akin to that of Ide, Courtier, Ghil, and Lorenc (1997). In our earlier study (Chan, Anderson, & Chen, 2020), the BGEnKF was formulated as a model state space filter [or, in the terminology of Anderson and Collins (2007), a sequential filter]. However, multi-process implementations of sequential filters require inter-process communications at every iteration of the serial assimilation loop. The sequential filter formulation thus does not scale well with parallelization (Anderson & Collins, 2007).
To ensure that the BGEnKF algorithm scales well with parallelization, the BGEnKF is formulated to constrain an extended state vector ψ (Anderson & Collins, 2007). ψ will contain all of the variables used in the BGEnKF. Aside from containing the model state x, ψ will also contain the simulated observation values y that correspond to said model state. Furthermore, since ξ [column-integrated frozen water mass content; see main text's Eq. (1)] can be used to discriminate clear column members from cloudy column members (see main text's section 2.2), we will include ξ at every observation site into ψ. The vector ξ will be used to denote the ξ values at every observation site. We can thus define Supposing N x denotes the number of elements in x and N y denotes the number of elements in y (and in ξ), then ψ has N x + 2N y elements. For the ease of writing, we will define Here, h(x f n ) represents calling the observation operator h on x f n , and ξ x f n represents evaluating ξ [Eq. (1) of main text] at every observation site using the information in x f n . Since the BGEnKF will be formulated as a serial assimilation algorithm, we can outline the essence of the algorithm by considering what happens when a single observation (y o ) is assimilated into an ensemble of forecasted ψ vectors. Like typical serially assimilating EnKF algorithms [e.g., Whitaker and Hamill (2002), Anderson et al. (2009), and Meng and Zhang (2007)], the serially assimilating BGEnKF algorithm is of the form: Select an unassimilated observation.
3. Divide the ensemble into the clear and cloudy clusters using the procedure described the main text's section 2.2.
We will thus formulate the BGEnKF equations by considering the assimilation of Supposing that the ensemble members have been sorted into the clear and cloudy clusters based on the ξ value at the observation site, the BGEnKF assumes that the prior probability density function [pdf; p (ψ)] can be represented by the bi-Gaussian pdf Throughout this document, we will use the subscript "clr" to denote clear cluster quantities, and the subscript "cld" to denote cloudy cluster quantities. G ψ; ψ f clr , P f clr denotes the clear cluster's Gaussian kernel with mean state ψ f clr and covariance matrix P f clr . Similarly, G ψ; ψ f cld , P f cld denotes the cloudy cluster's Gaussian kernel with mean state ψ f cld and covariance matrix P f cld . In general, the Gaussian pdf for a K-dimensional state p vector with some mean µ and covariance matrix C is defined as The scalar quantities w f clr and w f cld are the respective weights of the clear and cloudy Gaussian kernels. Note that The various parameters in the prior pdf [Eq. (5) Supposing g is a placeholder that can be replaced with "clr" or "cld", count (S g ) counts the number of elements in the set S g . The parameters of Eq. (5) can then be estimated via Note that the BGEnKF does not require any explicit estimate of the large matrices P f cld and P f clr . Instead, like the typical serially assimilating EnKF, the BGEnKF only requires calculating a column of these matrices. This will be discussed in Text S4.
To assimilate y o into ψ f 1 , . . . , ψ f N E , consider Bayes' rule: where the marginal p (y o ) normalizes the numerator of Eq. (8) [e.g., Lorenc (1986)]. As we will show later, this normalization property is central to deriving the posterior weights of the clear and cloudy posterior kernels. Note that though the normalization property is used in the derivation, there is no need to explicit compute p (y o ) at all in the BGEnKF algorithm.
If we assume Gaussian observation errors, the observation likelihood p (y o |ψ) can be written as where σ o2 is the observation error variance and H is a matrix that extracts the simulated observation from ψ. Before proceeding further, note that the observation likelihoods for IR-BTs are not strictly Gaussian. The associated observation errors are known to be dependent on the presence/absence of clouds in the observed atmospheric columns (Geer & Bauer, 2011;Harnisch et al., 2016;Otkin et al., 2018). Furthermore, IR-BT values are bounded. Nonetheless, the successes seen in assimilating IR-BTs with EnKFs suggest that the imperfect Gaussian observation likelihood assumption is at least somewhat functional (Otkin, 2012;Honda et al., 2018;Minamide & Zhang, 2018;Y. Zhang et al., 2018;Otkin & Potthast, 2019;F. Zhang et al., 2019;Geer et al., 2019;Chan, Zhang, et al., 2020;Jones et al., 2020;?, ?;Hartman et al., 2021;Y. Zhang et al., 2021). We will thus proceed with the assumption that the observation likelihood is Gaussian.
For the ease of future reference, we will sketch out the main steps to derive the posterior pdf. Combining the bi-Gaussian forecast pdf [Eq. (5) To proceed further, a well-known property is used: the multiplication of two Gaussian pdfs results in a scaled Gaussian pdf. This property is foundational to EnKFs (Evensen, 1994;Burgers et al., 1998;Houtekamer & Mitchell, 2001;Anderson, 2001;Bishop et al., 2001;Whitaker & Hamill, 2002;Tippett et al., 2003;Hunt et al., 2007). In this situation, for the term associated with cluster g [e.g., Anderson and Anderson (1999)], G ψ; ψ f g , P f g G Hψ; y o , σ o2 = α g G ψ; ψ a g , P a g (11) where ψ a g represents the analyzed average state of cluster g, P a g represents the analyzed covariance matrices of said cluster, and α g is a scaling factor. ψ a g and P a g are related to ψ f g and P f g via the Kalman filter (KF) equations [e.g., Lorenc (1986)] where K g is the Kalman gain matrix for cluster g. K g can be computed via where and n g is a dummy index that iterates over the member indices contained in S g . The scaling factor α g in Eq. (11) can be shown to be [e.g., Anderson and Anderson (1999)]: Note that H ψ f n , H ψ f g , and Var Hψ f g are scalars. Furthermore, if y o corresponds to the (N x + m)-th element of ψ, then Cov ψ f g , Hψ f g is equal to the (N x + m)-th column of P f g .
Substituting Eq. (11) into Eq. (10) and results in Since p (y o ) normalizes Eq. (15), then, where Like the EnKF, the BGEnKF will update the forecast ensemble to become consistent with the posterior bi-Gaussian pdf [Eq. (17)].

4.
Text S4: Detailed description of the three-stage BGEnKF algorithm The BGEnKF's updates to the ensemble is done through a three-stage update process (illustrated in the main text's Figure 1). In order of execution, these stages are: 1) the double EnKF stage, 2) the shrinking cluster member deletion stage, and 3) the expanding cluster member resampling stage. An outline of this three-stage BGEnKF update procedure can be found at the end of this section.

The double EnKF stage
The first stage [ Figure 1(a)] is to represent the KF updates to the clusters' mean states and covariance matrices. We can thus use the ensemble square root filter of Whitaker and Hamill (2002) (EnSRF) to update each cluster's members. The EnSRF update equation (Whitaker & Hamill, 2002) for members in cluster g is The Kalman gain matrix of cluster g (K g ) can be computed via Eq. (13). ϕ g is the EnSRF's square-root modification factor (Whitaker & Hamill, 2002), which can be computed via Note that the EnSRF-based cluster update equations can be replaced with those from the twostep ensemble adjustment Kalman filter (EAKF) of Anderson (2003). This is because the two filters have mathematically identical ensemble member update procedures.

The member deletion stage
In the second and third stages of the BGEnKF (Figure 1(b & c)), the number of ensemble members in each cluster (i.e., cluster sizes) is updated to be consistent with the cluster's posterior weight [Eq. (18)]. The post-BGEnKF size of cluster g (N a g ) can be determined by where round (·) indicates rounding · to the nearest integer.
If the size of a cluster is reduced by the assimilation of y o , we will delete members from said cluster (Figure 1(b)). The number of members to be deleted N del is defined as For simplicity, we will delete the members with the smallest N del forecast-simulated observation perturbations. Since the deletion will cause the cluster's mean state to deviate from the theoretical mean state [Eq. (12)], we will recenter the remaining members around the theoretical value. Note that no heuristic adjustments were made to mitigate the changes in the cluster's sample covariance matrix due to the deletion process. This is because it is impossible to prevent such changes in practical situations [for N E < N ψ , the rank of the pre-deletion sample covariance matrix is guaranteed to be higher than the rank of the post-deletion sample covariance matrix; Chan, Anderson, and Chen (2020)].

The resampling stage
If the size of one cluster is reduced by the assimilation of y o , the other cluster's size will increase to compensate for the reduction. This ensures that the total number of ensemble members is unchanged. To do so, the expanding cluster's ensemble members are resampled. The expanding cluster's sample mean state and sample covariance matrix should not be altered by resampling.
The computationally efficient resampling strategy proposed in Chan, Anderson, and Chen (2020) is to resample within the extended state subspace spanned by the expanding cluster's ensemble members (henceforth referred to as the subspace resampling strategy). This is the easiest to formulate in terms of the perturbations of the expanding cluster's members. Supposing that the subscript "pre" denotes expanding cluster quantities before resampling, we can compute the pre-resampling perturbations ψ a ′ n |n ∈ S pre via where ψ a pre is the expanding cluster's mean state and S pre is the set of member indices in the expanding cluster before resampling.
The central idea of the subspace resampling strategy is to construct a new set of perturbations via linear combinations of the pre-resampling perturbations. We will denote all post-resampling expanding cluster quantities with the subscript "post". Let S post denote the set of member indices in the post-resampling expanding cluster. S post thus includes the member indices in S pre and the indices of the members deleted in the deletion stage. If we represent the set of post-resampling perturbation vectors as ψ a * n * |n * ∈ S post , the strategy's central idea can then be mathematically expressed as ψ a * n * ≡ n∈Spre ψ a ′ n T n,n * ∀ n * ∈ S post where T n,n * is a to-be-determined scalar factor controlling how the n-th pre-resampling perturbation contributes to the n * -th post-resampling perturbation. This linear combination idea can be more succinctly expressed as Here, Ψ pre is a matrix where each column contains a pre-resampling perturbation, and Ψ post is a matrix where each column contains a post-resampling perturbation. Supposing the preresampling cluster size is denoted by N pre and the post-resampling cluster size is denoted by N post , then Ψ pre is an N ψ × N pre matrix and Ψ post is an N ψ × N post matrix. If we denote the ℓ-th member index in S pre as n pre,ℓ , and likewise for the ℓ-th member index in S post , we can explicitly write out Ψ pre and Ψ post : Ψ pre ≡ ψ a ′ n pre,1 ψ a ′ n pre,2 · · · ψ a ′ n pre,Npre , Ψ post ≡ ψ a * n post,1 ψ a * n post,2 · · · ψ a * n post,N post .
Finally, T is an N pre × N post matrix containing all of the T n,n * values [i.e., element (n, n * ) of T is equal to T n,n * ].
T should be constructed such that the post-resampling perturbations have a mean of zero and have a covariance matrix equal to that of pre-resampling perturbations. As discussed in Chan, Anderson, and Chen (2020), there are an infinite number of possible T 's that satisfy these two conditions. Following the discussions and heuristic arguments in Chan, Anderson, and Chen (2020), we chose to use Furthermore, for arbitrary integers η and µ, I η is an η × η identity matrix, 0 η×µ is an η × µ matrix of zeros. k is the following scalar inflation factor The matrix E in Eq. (26) is an N * new ×N new matrix that will be defined shortly. Since N * new < N new [see Eq. (27)], E is a rectangular matrix with more columns than rows. Note that whenever N new > N pre , the kI Npre−N * new component vanishes from T . Furthermore, whenever N new = 1, the I N * new and E components vanish from T . Our choice of E is nearly identical to that of Chan, Anderson, and Chen (2020): Here, 1 N * new ×Nnew denotes an N * new ×N new matrix whose elements are all set to unity. Furthermore, W is an N * new × N new matrix of the form Supposing that Chol (S) denotes the Cholesky decomposition of an arbitrary symmetric matrix S, following appendix B of Chan, Anderson, and Chen (2020), we define and The only difference between the current formulation of E and that of Chan, Anderson, and Chen (2020) lies in the W matrix. In Chan, Anderson, and Chen (2020), W is created from vectors of random white noise. For the ease of parallelization and to ensure replicability (i.e., reruns of the BGEnKF should give the same result), we replaced that stochastic W generation procedure with a deterministic one [i.e., Eq. (30)].
As discussed in Chan, Anderson, and Chen (2020), the resampled perturbations generated by the T defined in Eq. (26) has the property of preserving the pre-resampling perturbations (up to an inflation factor). More specifically, the first N pre − N * new resampled perturbations are inflated versions of the first N pre − N * new pre-resampling perturbations. The next N * new resampled perturbations are copies of N * new of the pre-resampling perturbations. Finally, the remaining N new resampled perturbations are linear combinations of the copied perturbations.
Outline of three-stage BGEnKF update procedure to assimilate an observation The outline of the three-stage BGEnKF procedure is as follows. Note that this outline assumes that the members have already been sorted into the clear and cloudy clusters (see the last paragraph of Text S2 for how members are sorted into the two clusters). 1. Evaluate Eq. (21) to determine the targeted cluster sizes after assimilating the observation 2. If N a clr < N f clr , the clear cluster will be considered as the shrinking cluster. 3. If N a cld < N f cld , the cloudy cluster will be considered as the shrinking cluster. 4. If no shrinking cluster has been identified, terminate the current stage.
6. Compute the current mean state of the shrinking cluster.
7. Delete the members with the smallest N del forecast-simulated observation perturbations within the shrinking cluster.
8. Compute the mean state of the remaining members in the shrinking cluster.
9. Subtract the mean computed in step 8 from the mean computed in step 6.
10. Add the difference computed in step 9 to each of the remaining members in the shrinking cluster to recenter said members on the pre-deletion shrinking cluster mean state.
To implement this algorithm with parallelization on the PSU-EnKF system, we employed the low-latency computing cluster strategy proposed by Anderson and Collins (2007). Specifically, every process will receive a sub-domain's worth of model state variables, an entire domain of observation and simulated observation values, and an entire domain of ξ values. To assimilate an observation, each process will then update its sub-domain of model state variables, all of its simulated observations, and all of its ξ values. As such, no inter-process communications are needed within the serial assimilation loop.

6.
Text S6: On generalizing the BGEnKF algorithm to handle more clusters The BGEnKF algorithm can be generalized to handle an arbitrary number of ensemble clusters (e.g., a three-cluster GMM-EnKF). We did not use more than two clusters in this study because this study is a first approach to testing a cluster GMM-EnKF with a realistic weather model. Furthermore, using more clusters means that each cluster will contain fewer members. With smaller cluster sizes, the deleterious impacts of sampling errors on each cluster's sample statistics are likely stronger. Considering the small ensemble size that will be used in this first-approach study (50 members), we opted to use two clusters for now.
To generalize the BGEnKF to handle N c clusters, only a few modifications are needed: 1) the ensemble clustering method needs to be adjusted to sort the ensemble into the N c clusters, and 2) a slightly different method would be needed to infer the posterior cluster sizes [Eq. (22)]. The latter modification is necessary because using Eq. (22) with more than 2 clusters can cause the total number of ensemble members to change. This change arises from the use of the rounding function. For instance, suppose we have 3 clusters with equal posterior weights (0.333333 each) and the ensemble size is 10. Using Eq. (22) will result in 3 members in each cluster, or 9 members in total. A different approach to convert the non-integer weights into integer cluster sizes is thus necessary for N c > 2. Figure S2: Workflow of the BGEnKF module implemented in the PSU DA system. "Obs" stands for "observations" and N obs stands for the total number of observations. See the text for the definitions of the extended state vector ψ [Eq. (3)], the list of heuristic checks used to select between the EnKF and BGEnKF (main text section 2.5), and for a description of the BGEnKF update procedure (Text S4). The three-stage BGEnKF update procedure is illustrated in Figure 1 of the main text.